# CISC 260 Machine Organization and Assembly Language

Performance

[Adapted from Mary Jane Irwin ( <a href="www.cse.psu.edu/~mji">www.cse.psu.edu/~cg431</a> and from Computer Organization and Design, Patterson & Hennessy, © 2005, UCB]

#### **Performance Metrics**

- Purchasing perspective
  - given a collection of machines, which has the
    - best performance?
    - least cost ?
    - best performance/cost?
- Design perspective
  - faced with design options, which has the
    - best performance improvement ?
    - least cost ?
    - best performance/cost?
- Both require
  - basis for comparison
  - metric for evaluation
- Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

# Defining (Speed) Performance

- Normally interested in reducing
  - Response time (aka execution time) the time between the start and the completion of a task
    - Important to individual users
  - Thus, to maximize performance, need to minimize execution time

If X is n times faster than Y, then

- Throughput the total amount of work done in a given time
  - Important to data center managers
- Decreasing response time almost always improves throughput

### Performance Factors

- Want to distinguish elapsed time and the time spent on our task
- CPU execution time (CPU time) time the CPU spends working on a task
  - Does not include time waiting for I/O or running other programs

CPU execution time = # CPU clock cycles x clock cycle time for a program or

CPU execution time = # CPU clock cycles for a program for a program clock rate

Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

#### Review: Machine Clock Rate

 Clock rate (MHz, GHz) is inverse of clock cycle time (clock period)

$$CC = 1/CR$$



10 nsec clock cycle => 100 MHz clock rate

5 nsec clock cycle => 200 MHz clock rate

2 nsec clock cycle => 500 MHz clock rate

1 nsec clock cycle => 1 GHz clock rate

500 psec clock cycle => 2 GHz clock rate

250 psec clock cycle => 4 GHz clock rate

200 psec clock cycle => 5 GHz clock rate

Different numbers of cycles for different instructions



- Multiplication takes more time than addition
- Floating point operations take longer than that of integer
- Accessing memory takes more time than accessing registers

• Important point: changing the cycle time often changes the number of cycles required for various instructions (more later)

## Clock Cycles per Instruction

- Not all instructions take the same amount of time to execute
  - One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction

```
# CPU clock cycles for a program = # Instructions x Average clock cycles per instruction
```

- Clock cycles per instruction (CPI) the average number of clock cycles each instruction takes to execute
  - A way to compare two different implementations of the same ISA

|     | CPI for this instruction class |   |   |
|-----|--------------------------------|---|---|
|     | Α                              | В | С |
| CPI | 1                              | 2 | 3 |

### **Effective CPI**

 Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging

Overall effective CPI = 
$$\sum_{i=1}^{n} (CPI_i \times IC_i)$$

- Where IC<sub>i</sub> is the count (percentage) of the number of instructions of class i executed
- CPI<sub>i</sub> is the (average) number of clock cycles per instruction for that instruction class
- n is the number of instruction classes
- The overall effective CPI varies by instruction mix a measure of the dynamic frequency of instructions across one or many programs

## THE Performance Equation

Our basic performance equation is then

- These equations separate the three key factors that affect performance
  - Can measure the CPU execution time by running the program
  - The clock rate is usually given
  - Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details
  - CPI varies by instruction type and ISA implementation for which we must know the implementation details

## **Determinates of CPU Performance**

CPU time = Instruction\_count x CPI x clock\_cycle

|                        | Instruction_<br>count | CPI | clock_cycle |
|------------------------|-----------------------|-----|-------------|
| Algorithm              |                       |     |             |
| Programming language   |                       |     |             |
| Compiler               |                       |     |             |
| ISA                    |                       |     |             |
| Processor organization |                       |     |             |
| Chip                   |                       |     |             |
| Technology             |                       |     |             |

## **Determinates of CPU Performance**

CPU time = Instruction\_count x CPI x clock\_cycle

|                        | Instruction_<br>count | CPI | clock_cycle |  |
|------------------------|-----------------------|-----|-------------|--|
| Algorithm              | X                     | X   |             |  |
| Programming language   | X                     | X   |             |  |
| Compiler               | X                     | X   |             |  |
| ISA                    | X                     | X   | X           |  |
| Processor organization |                       | X   | X           |  |
| Chip<br>Technology     |                       |     | X           |  |

## A Simple Example

| Ор     | Freq | CPI <sub>i</sub> | Freq x CPI <sub>i</sub> |
|--------|------|------------------|-------------------------|
| ALU    | 50%  | 1                |                         |
| Load   | 20%  | 5                |                         |
| Store  | 10%  | 3                |                         |
| Branch | 20%  | 2                |                         |
|        |      |                  | $\Sigma =$              |

- How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
- How does this compare with using branch prediction to shave a cycle off the branch time?
- What if two ALU instructions could be executed at once?

## A Simple Example

| Ор     | Freq | CPI <sub>i</sub> | Freq x     | CPI <sub>i</sub> |     |
|--------|------|------------------|------------|------------------|-----|
| ALU    | 50%  | 1                |            | .5               | .5  |
| Load   | 20%  | 5                |            | 1.0              | .4  |
| Store  | 10%  | 3                |            | .3               | .3  |
| Branch | 20%  | 2                |            | .4               | .4  |
|        |      |                  | $\Sigma =$ | 2.2              | 1.6 |

| .25  | .5  | .5  |
|------|-----|-----|
| 1.0  | 1.0 | .4  |
| .3   | .3  | .3  |
| .4   | .2  | .4  |
| 1.95 | 2.0 | 1.6 |

 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

CPU time new =  $1.6 \times IC \times CC$  so 2.2/1.6 means 37.5% faster

 How does this compare with using branch prediction to shave a cycle off the branch time?

CPU time new =  $2.0 \times IC \times CC$  so 2.2/2.0 means 10% faster

What if two ALU instructions could be executed at once?

CPU time new =  $1.95 \times IC \times CC$  so 2.2/1.95 means 12.8% faster

## Comparing and Summarizing Performance

- How do we summarize the performance for benchmark set with a single number?
  - The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)

$$AM = 1/n \sum_{i=1}^{n} Time_i$$

- Where Time<sub>i</sub> is the execution time for the i<sup>th</sup> program of a total of n programs in the workload
- A smaller mean indicates a smaller average execution time and thus improved performance
- Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

### Amdahl's Law

- Execution time\_after = exe time\_unaffected + exe time\_affected/ speedup.
- Execution time\_before = exe\_time\_unaffected + exe\_time\_affected.
- Speedup\_overall = execution\_time\_before / execution\_time\_after.

1
= -----(1 – fraction\_enhanced) + fraction\_enhanced/ speedup

E.g., A program runs 100s (80s for multiplication and 20s for others)

How much speedup for mul is needed to speedup the program 5 times faster?

E.g., Old CPU (40% on computation, and 60% on IO). New CPU is 10x faster on computation in Web service applications. What's the overall speedup?

### Amdahl's Law

E.g., A program runs 100s (80s for multiplication and 20s for others)

How much speedup for mul is needed to speedup the program 5 times faster?

Answer: let speedup for mul be x.

time\_after = 20s = 80s / x + 20 => 80s / x = 0 => no such x possible.

E.g., Old CPU (40% on computation, and 60% on IO). New CPU is 10x faster on computation in Web service applications. What's the overall speedup?

### Amdahl's Law

E.g., Old CPU (40% on computation, and 60% on IO). New CPU is 10x faster on computation in Web service applications. What's the overall speedup?

```
Answer: fraction_enhanced = 0.4

speedup_enhanced = 10

Speedup_overall = 1 / [0.6 + 0.4/10] = 1 / 0.64 \approx 1.56.
```

## Summary: Evaluating ISAs

#### Design-time metrics:

- Can it be implemented, in how long, at what cost?
- Can it be programmed? Ease of compilation?

#### Static Metrics:

— How many bytes does the program occupy in memory?

#### Dynamic Metrics:

- How many instructions are executed? How many bytes does the processor fetch to execute the program?
- How many clocks are required per instruction?
- How "lean" a clock is practical?

Best Metric: Time to execute the program!

depends on the instructions set, the processor organization, and compilation techniques.

